Probabilistic Noise Identification and Data Cleaning

نویسندگان

  • Jeremy Kubica
  • Andrew W. Moore
چکیده

Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is to identify and remove records that contain corruptions. Unfortunately, if only certain fields in a record have been corrupted then usable, uncorrupted data will be lost. In this paper we present LENS, an approach for identifying corrupted fields and using the remaining noncorrupted fields for subsequent modeling and analysis. Our approach uses the data to learn a probabilistic model containing three components: a generative model of the clean records, a generative model of the noise values, and a probabilistic model of the corruption process. We provide an algorithm for the unsupervised discovery of such models and empirically evaluate both its performance at detecting corrupted fields and, as one example application, the resulting improvement this gives to a classifier.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data

Database systems for uncertain and probabilistic data promise to have many applications. Query processing on uncertain data occurs in the contexts of data warehousing, data integration, and of processing data extracted from the Web. Data cleaning can be fruitfully approached as a problem of reducing uncertainty in data and requires the management and processing of large amounts of uncertain dat...

متن کامل

Probabilistic Contaminant Source Identification in Water Distribution Infrastructure Systems

Large water distribution systems can be highly vulnerable to penetration of contaminant factors caused by different means including deliberate contamination injections. As contaminants quickly spread into a water distribution network, rapid characterization of the pollution source has a high measure of importance for early warning assessment and disaster management. In this paper, a methodology...

متن کامل

A Robust Strucutural Fingerprint Restoration

Fast and accurate ridge detection in fingerprints is essential to each AFIS (Automatic Fingerprint Identification System). Smudged furrows and cut ridges in the image of a finger print are major problems in any AFIS. This paper investigates a new online ridge detection method that reduces the complexity and costs associated with the fingerprint identification procedure. The noise in fingerprint...

متن کامل

Bayesian Data Cleaning for Web Data

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns–CFDs–from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting...

متن کامل

A Formal Framework For Probabilistic Unclean Databases

Traditional modeling of inconsistency in database theory casts all possible “repairs” equally likely. Yet, effective data cleaning needs to incorporate statistical reasoning. For example, yearly salary of $100k and age of 22 are more likely than $100k and 122 and two people with same address are likely to share their last name (i.e., a functional dependency tends to hold but may occasionally be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003